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Introduction 

Discourse markers are a linguistic devise 
use to signal how the upcoming unit of 



that speakers 
or text re- 
lates to the current discourse state (iSchiffriili 1987). Previ- 



spe ech 



ous work in computational linguistics has er tphasized their 
role in marking changes in the global discpurse structure 



(e.g. (Grosz&Sidnerl986; Reichman 1985 



For instance, "by the way" is used to mark 
gression, "anyway" to mark the return from 
to shift to a new topic. Schiffrin's work in 
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1983 ), so it is not clear whether discourse markers are play- 
tl le same role in task-oriented dialogs as in other forms 
discourse. 

e problem with discourse markers, however, is that 
is ambiguity as to whether lexical items are function- 
discourse markers. Consider the lexical item "so". 
(|>nly can it be used as a discourse marker to introduce 
erance, but it can also be used sententially to indicate 
sublordinating clause as illustrated by the following exam- 
the Trains corpus. 

Example 1 (d93-15.2 utt9) 

an hour to load them 
you know 

Discourse markers can also be used inside an utterance to 
a speech repair, where the speaker goes back and re- 
or corrects something she just said. Here, the dis- 
e markers play a much more internal role, as the fol- 
g example with "well" illustrates. 

Exaijiple 2 (d93-26.3 uttl2) 

lave engine well if I take engine one and pick up a boxcar 
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Due to these difficulties, an effective algorithm for iden- 
tifyir g discourse markers in spontaneous speech needs to 
tddress the problem of segmenting speech into utter- 
units and identifying speech repairs (Heeman & Allen 



also 
ance 
1997b) 

In the rest of this paper, we first review the Trains cor- 
pus £ nd the manner in which the discourse markers were 
anno ated by using special part-of-speech (POS) tags to 
denote them. We then examine the role that discourse 
play in task-oriented dialogs. We then present 
peech recognition language model, which incorpo- 
POS tagging, and thus discourse marker identifica- 
We show that distinguishing discourse marker us- 
•esults in improved language modeling. We also show 
discourse marker identification is improved by mod- 
interactions with utterance segmentation and resolv- 
sj^eech repairs. From this, we conclude that discourse 
can be used by hearers to set up expectations of 
)le that the upcoming utterance plays in the dialog, 
o the ability to automatically identify discourse mark- 
g the speech recognition process, we argue that 
;an be exploited in the task of dialog act identification, 
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which is currently receiving much attention in spontaneous 
i (e.g . ( [Taylor etal. \99l\ |Chu-Carroll 1998 ; 
1998)). We conclude with a comparison to 



speech research (e.g. (Taylor et al. 1997 
Stolcke et al. 



the method proposed by Litman (1996) for identifying dis- 
course markers. 

Trains Corpus 



As part of the Trains project (Allen et al. 1995), which is a 



long term research project to build a conversationally profi- 
cient planning assista nt, we have collected a c orpus of prob- 
lem solving dialogs ( frleeman & Allen 1995 ). The dialogs 
involve two human participants, one who is playing the role 
of a user and has a certain task to accomplish, and another 
who is playing the role of the system by acting as a plan- 
ning assistant. The collection methodology was designed 
to make the setting as close to human-computer interaction 
as possible, but was not a wizard scenario, where one per- 
son pretends to be a computer; rather, the user knows that 
he is talking to another person. The Trains corpus consists 
of approximately six and a half hours of speech. Table |] 
gives some general statistics about the corpus, including the 
number of dialogs, speakers, words, speaker turns, and oc- 
currences of discourse markers. 



Dialogs 


98 


Speakers 


34 


Words 


58298 


Turns 


6163 


Discourse Markers 


8278 



Table 1 : Size of the Trains Corpus 

Our strategy for annotating discourse markers is to mark 
such usages with special POS tags. Four special POS tags 
were added to the Penn Treebank tagset (Marcus, Santorini, 
& Marcinkiewicz 1993) to denote discourse marker usage. 
These tags are defined in Table ^.Q Verbs used as discourse 

AC: Single word acknowledgments, such as "okay", 
"right", "mm-hm", "yeah", "yes", "alright", "no", and 
"yep". 

UH_D: Interjections with discourse purpose, such as 
"oh"", "well", "hm", "mm", and "like". 

CC_D: Co-ordinating conjuncts used as discourse mark- 
ers, such as "and", "so", "but", "oh", and "because". 

RB_D: Adverbials used as discourse markers, such as 
"then", "now", "actually", "first", and "anyway". 

Table 2: POS tags for Discourse Markers 

markers, such as "wait", and "see", are not given special 
markers, but are annotated as verbs. Also, no attempt has 



been made at analyzing multi-word discourse markers, such 
as "by the way" and "you know". However, phrases such 
as "oh really" and "and then" are treated as two individual 
discourse markers. Lastly, filled pause words, namely "uh", 
"um" and "er", are marked with UH_FP; but these are not 
considered as discourse markers. 

POS-Based Language Model 

The traditional goal of speech recognition is to find the 
sequence of words W that is maximal given the acous- 
tic signal A. In earlier work (Heeman & Allen 1997a; 
frjeeman 1997 ), we argue that this view is too limiting. In a 
spoken dialog system, word recognition is just the first step 
in understanding the speaker's turn. Furthermore, speech 
recognition is difficult especially without the use of higher 
level information. Hence, we propose as a first step to in- 
corporate POS tagging into the speech recognition process. 

Previous approaches that have made use of POS tags 
in speech recognition view the POS tags as intermediate 
objects by summing over the POS tag sequences (Jelinek 
1985). Instead, we take the approach of redefining the goal 
of the speech recognition process so that it finds the best 
word (W) and POS tag (P) sequence given the acoustic 
signal. The derivation of the acoustic model and language 
model is now as follows. 



WP = 



argmaxPr(WP|A) 

W,P 



= are; max 

WP 



Px(A\WP) Pr(V^P) 
Pr(A) 



= argmaxPrM|I/^P)Pr(T/^P) 
° WP 

The first term Pr(^4|iyP) is the factor due to the acoustic 
model, which we can approximate by Pr(A| W). The sec- 
ond term Pr(WP) is the factor due to the language model. 
We rewrite Pv(WP) as Pt(Wi,nPi,n), where N is the 
number of words in the sequence. We now rewrite the lan- 
guage model probability as follows. 

Pt(W 1iN P 1>n ) 

= n pr (w^iwvi^vi) 

i=l,N 

= n PT(Wi\w X i-iP X i)Pi(Pi\WM-iPv-i) 

i=l,N 



The final probability distributions are similar to those used 
by previous attempts to use POS tags in language modeling 
(Jelinek 1985) and those used for POS tagging of written 
text (|Charniak et al. 1993J |Church 198$ peRose 1988] ). 



However, these approaches simplify the probability distri- 
butions as shown by the approximations below. 



1 Other additions to the tagset are described in Heeman ( 1997). 



Pr(Wi|Wj,i-iP v ) 
Pr(P|WViPi,*-i) 



Pr(Wi\Pi) 
Pr(P|P M -i) 



However, as we have shown in earlier work (Heeman & 
Allen 1997a; |Heeman 1997 ), such simplifications lead to 
poor language models. 

Probability Distributions 

We have two probability distributions that need to be es- 
timated. The simplest approach for estimating the prob- 
ability of an event given a context is to use the relative 
frequency that the event occurs given the context accord- 
ing to a training corpus. However, no matter how large the 
training corpus is, there will always be event-context pairs 
that have not been seen or that have been seen too rarely to 
accurately estimate the probability. To alleviate this prob- 
lem, one can partition the contexts into a smaller number 
of equivalence classes and use these equivalence classes to 
compute the relative frequencies. 

We use a decision tree l earning algorithm (Ba hl et al. 



1989; Black et al. 1992; Breiman et al. 1984), which 



uses information theoretic measures to construct equiva- 
lence classes of the context in order to cope with sparse- 
ness of data. The decision tree algorithm starts with all of 
the training data in a single leaf node. For each leaf node, it 
looks for the question to ask of the context such that split- 
ting the node into two leaf nodes results in the biggest de- 
crease in impurity, where the impurity measures how well 
each leaf predicts the events in the node. Heldout data is 
used to decide when to stop growing the tree: a split is re- 
jected if the split does not result in a decrease in impurity 
with respect to the heldout data. After the tree is grown, the 
heldout dataset is used to smooth the probabilities of each 
node with its parent (Bahl et al. 1989). 

To allow the decision tree to ask questions about the 
words and POS tags in the context such that the questions 
can generalize about words and POS tags that behave sim- 
ilarly, we cluster the words and POS tags using the algo- 
rithm of Brown et al. (1992 ) into a binary classification tree. 
The algorithm starts with each word (or POS tag) in a sep- 
arate class, and successively merges classes that result in 
the smallest lost in mutual information in terms of the co- 
occurrences of these classes. By keeping track of the order 
that classes were merged, we can construct a hierarchical 
classification of the classes. Figure [j] shows a POS classifi- 
cation tree, which was automatically built from the training 
data. Note that the classification algorithm has clustered 
the discourse marker POS tags close to each other in the 
classification tree. 

The binary classification tree gives an implicit binary en- 
coding for each POS tag, which is determined by the se- 
quence of top and bottom edges that leads from the root 
node to the node for the POS tag. The binary encoding al- 
lows the decision tree to ask about the words and POS tags 
using simple binary questions, such as 'is the third bit of 
the POS tag encoding equal to one?' the POS tag. 




-PRPS 
- WDT 



Figure 1 : POS Classification Tree 



Unlike other work (e.g. (Black et al. 1992; Magerman 



1995)), we treat the word identities as a further refinement 
of the POS tags; thus we build a word classification tree 
for each POS tag. We grow the classification tree by start- 
ing with a unique class for each word and each POS tag 
that it takes on. When we merge classes to form the hi- 
erarchy, we only allow merges if all of the words in both 
classes have the same POS tag. The result is a word clas- 
sification tree for each POS tag. This approach of building 
a word classification tree for each POS tag has the advan- 
tage that it better deals with words that can take on multiple 
senses, such as the word "loads", which can be a plural 
noun (NNS) or a present tense third-person verb (VBZ). As 
well, it constrains the task of building the word classifica- 
tion trees since the major distinctions are captured by the 
POS classification tree, thus allowing us to build classifica- 
tion trees even for small corpora. Figure ^| gives the classi- 
fication tree for the acknowledgments (AC). For each word, 
we give the number of times that it occurred in the training 
data. Words that only occurred once in the training cor- 
pus have been grouped together in the class 'lunknown'. 
Although the clustering algorithm was able to group some 
of the similar acknowledgments with each other, such as 
the group of "mm-hm" and "uh-huh", the group of "good", 
"great", and "fine", other similar words were not grouped 
together, such as "yep" with "yes" and "yeah", and "no" 
with "nope". Word adjacency information is insufficient 
for capturing such semantic information. 




! unknown 1 
fine 4 
exactly 6 
good 13 
great 14 



-sorry 14 
-alright 155 
-okay 1700 
'hello 71 
-hi 

-yeah 185 
-yes 194 
-no 128 

-nope 5 

- sure 9 

-correct 13 

-yep 108 

-mm-hm 246 

. -uh-huh30 
-nght 434 



Figure 2: AC Classification Tree 

Results 

To demonstrate our model, we use a 6-fold cross validation 
procedure, in which we use each sixth of the corpus for 
testing data, and the rest for training data. We start with the 
word transcriptions of the Trains corpus, thus allowing us 
to get a clearer indication of the performance of our model 
without having to take into account the poor performance 
of speech recognizers on spontaneous speech. 

Table || reports the results of explicitly modeling dis- 
course markers with special POS tags. The second column, 
"No DM", reports the results of collapsing the discourse 
marker usages with the sentential usages. Thus, the dis- 
course conjunct CC_D is collapsed into CC, the discourse 
adverbial RB_D is collapsed into RB, and the acknowledg- 
ment AC and discourse interjection UH_D are collapsed 
into UH_FP. The third column gives the results of the 
model that does distinguish discourse marker usages, but 
ignoring POS errors due to miscategorizing words as being 
discourse markers or not. We see that modeling discourse 
markers results in a reduction of POS errors from 1219 to 
1 189, giving a POS error rate of 2.04%. We also see a small 
decrease in perplexity from 24.20 to 24.04. Perplexity of a 
test set of N words w\ at is calculated as follows. 



2~7f Sill log 2 Pr Oi|WM-l) 



In previous work ( [Heeman & Allen 1997bfc Heeman 
1997), we argued that discourse marker identification is 





No DM 


DM 


POS Errors 
POS Error Rate 
Perplexity 


1219 
2.09 
24.20 


1189 
2.04 
24.04 





Base 
Model 


Tones 
Repairs 
Corrections 


Tones 
Repairs 
Corrections 
Silences 


POS Tagging 
Errors 


1711 


1652 


1572 


Error Rate 


2.93 


2.83 


2.69 


Perplexity 


24.04 


22.96 


22.35 


Discourse Markers 








Errors 


630 


611 


533 


Recall 


96.75 


96.67 


97.26 


Precision 


95.68 


95.97 


96.32 



Table 4: POS Tagging and Perplexity Results 



tightly intertwined with the problems of intonational phrase 
identification and resolving speech repairs. These three 
tasks, we claim, are necessary in order to understand the 
user's contributions. In Table we show how discourse 
marker identification, POS tagging and perplexity benefit 
by modeling the speaker's utterance. The second column 
gives the results of the POS-based model, which was used 
in the third column of Table ||, the third column gives the re- 
sults of incorporating the detection and correction of speech 
repairs and detection of intonational phrase boundary tones, 
and the fourth column gives the results of adding in silence 
information to give further evidence as to whether a speech 
repair or boundary tone occurred. As can be seen, modeling 
the user's utterances improves POS tagging and word per- 
plexity; adding in silence information to help detect speech 
repairs and intonational boundaries further improves these 
two rates.0 Of concern to this paper, we also see an im- 
provement in the identification of discourse markers, im- 
proving from 630 to 533 errors. This gives a final recall rate 
of 97.26% and a precision of 96.32^ In Heeman ( |1997| ), we 
also show that modeling discourse markers improves the 
detection of speech repairs and intonational boundaries. 



Comparison to Other Work 



Hirschberg and Litman (1993) examined how intonational 



information can distinguish between the discourse and sen- 
tential interpretation for a set of ambiguous lexical items. 
This work was based on hand-transcribed intonational fea- 
tures and examined discourse markers that were one word 
long. In an initial study of the discourse marker "now", 
they found that discourse usages of the word "now" were 
either an intermediate phrase by themselves (or in a phrase 
consisting entirely of ambiguous tokens), or they are first in 



Table 3: Discourse Markers and Perplexity 



2 Note the POS results include errors due to miscategorizing 
discourse markers, which were excluded from the POS results re- 
ported in Table ^. 

3 The recall rate is the number of discourse markers that were 
correctly identified over the actual number of discourse markers. 
The precision rate is the number of correctly identified discourse 
markers over the total number of discourse markers guessed. 



an intermediate phrase (or preceded by other ambiguous to- 
kens) and are either de-accented or have a low accent (L*). 
Sentential uses were either non-initial in a phrase or, if first, 
bore a high (H*) or complex accent (i.e. not a L* accent). 
In a second study, Hirschberg and Litman used a speech 
consisting of approximately 12,500 words. They found that 
the intonational model that they had proposed for the dis- 
course marker "now" achieved a recall rate of 63. 1 % of the 
discourse markers with a precision of 88.3%|] 

Hirschberg and Litman also looked at the effect of or- 
thographic markers and POS tags. For the orthographic 
markings, they looked at how well discourse markers can 
be predicted based on whether they follow or precede a 
hand-annotated punctuation mark. They also examined cor- 
relations with POS tags. For this experiment, rather than 
define special POS tags as we have done, they choose dis- 
course marker interpretation versus sentential interpretation 
based on whichever is more likely for that POS tag, where 
the POS tags were aut omatic ally computed using Church's 
part-of-speech tagger ( 1988 ). This gives them a recall rate 
of 39.0% and a precision of 55.2%. 



Litman ( 1996 ) explored using machine learning tech- 
niques to automatically learn classification rules for dis- 
course markers. She contrasted the performance of 
CGRENDEL ([Cohen 1992fc |Cohen 1993|) with C4.5 (Quin- 



lan 1993). CGRENDEL is a learning algorithm that learns 
an ordered set of if-then rules that map a condition to its 
most-likely event (in this case discourse or sentential in- 
terpretation of potential discourse marker). C4.5 is a deci- 
sion tree growing algorithm that learns a hierarchical set of 
if-then rules in which the leaf nodes specify the mapping 
to the most-likely event. She found that machine learn- 
ing techniques could be used to learn a classification al- 
gorithm that was as good as the algorithm manually built 
by Hirschberg and Litman (1993). Further improvements 



were obtained when different sets of features about the con- 
text were explored, such as the identity of the token under 
consideration. The best results (although the differences be- 
tween this version and some of the others might not be sig- 
nificant) were obtained by using CGRENDEL and letting 
it choose conditions from the following set: length of in- 
tonational phrase, position of token in intonational phrase, 
length of intermediate phrase, position of token in interme- 
diate phrase, composition of intermediate phrase (token is 
alone in intermediate phrase, phrase consists entirely of po- 
tential discourse markers, or otherwise), and identity of po- 
tential discourse marker. The automatically derived classi- 
fication algorithm achieved a success rate of 85.5%, which 
translates into a discourse marker error rate of 37.3%, in 
comparison to the error rate of 45.3% for the algorithm of 
Hirschberg and Litman (1993). Hence, machine learning 



techniques are an effective way in which a number of dif- 
ferent sources of information can be combined to identify 
discourse markers. 

Direct comparisons with our results are problematic 
since our corpus is approximately five times as large. Also 
we use task-oriented human-human dialogs, rather than a 
monologue, and hence our corpus includes a lot of turn- 
initial discourse markers for co-ordinating mutual belief. 
However, our results are based on automatically identifying 
intonational boundaries, rather than including these as part 
of the input. In any event, the work of Litman and the earlier 
work with Hirschberg indicate that our results can be fur- 
ther improved by also modeling intermediate phrase bound- 
aries (phrase accents), and word accents, and by improving 
our modeling of these events, perhaps by using more acous- 
tic cues. Conversely, we feel that our approach, which inte- 
grates discourse marker identification with speech recogni- 
tion along with POS tagging, boundary tone identification 
and the resolution of speech repairs, allows different inter- 
pretations to be explored in parallel, rather than forcing in- 
dividual decisions to be made about each ambiguous token. 
This allows interactions between these problems to be mod- 
eled, which we feel accounts for some of the improvement 
between our results and the results reported by Litman. 

Predicting Speech Acts 

Discourse markers are a prominent feature of human- 
human task-oriented dialogs. In this section, we exam- 
ine the role that discourse markers, other than acknowl- 
edgments, play at the beginning of speaker turns and show 
that discourse markers can be used by the hearer to set up 
expectations of the role that the upcoming utterance plays 
in the dialog. Table 5 gives the number of occurrences of 
discourse markers in turn initial position in the Trains cor- 
pus. From column two, we see that discourse markers start 
4202 of the 6163 utterances in the corpus, or 68.2%. If 
we exclude turn-initial filled pauses and acknowledgments 
and exclude turns that consist of only filled pauses and dis- 
course markers, we see that 44. 1 % of the speaker turns are 
marked with a non-acknowledgment discourse marker. 



4 See Heeman ( 1997 1 for a derivation of the recall and precision 
rates. 









Turn 
start 


> that 
with 


Number 


Excluding 
AC's and UI 


initial 

JP's 


AC 

ccj 

RBJ 
UHJ 
UHJ 

Othei 


i 
i 
) 

P 


3040 
824 
63 
275 
462 

1499 




n.a. 
1414 
154 

302 
n.a. 

2373 


Total 




6163 




4243 









Table 5: Discourse markers in turn-initial position 



Restate A restatement of either the plan or facts in the 
world that have been explicitly stated before. 

Summarize Plan A restatement of the current working 
plan where this plan has been previously built up in 
pieces but has not been previously stated in its entirety. 

Request for summary Typically questions about the total 
time the plan will take, such as "what's the total on that." 

Conclude Explicit conclusion about the planning state that 
has not been stated previously, e.g. 'So that's not enough 
time' or 'So we have thirteen hours' 

Elaborate Plan Adding new plan steps onto the plan, e.g. 
"How about if we bring engine two and two boxcars from 
Elmira to Corning" 

Correction Correcting either the plan or a misconception 
of the other speaker. 

Respond to new info Explicit acknowledgment of new in- 
formation, such as "oh really" or "then let's do that". 

Table 6: Conversational move categories 



Conversational Move 


Turns beginning with 


And 


Oh 


So 


Well 


Restate 








6 





Summarize Plan 


5 





4 





Request for summary 


1 





3 





Conclude 








15 





Elaborate Plan 


22 











Correction 











7 


Respond to new info 





17 









Table 7: Correlations with conversational move 



man 1997b), we investigated the role that discourse mark- 
ers play in task-oriented human-human dialogs. We investi- 
gated Shriffin's claim that discourse markers can be used to 
express the relationship between the information in the up- 
coming utteranc e to the information in the discourse state 
( Schiffrin 1987 ). For each turn that began with a discourse 
marker, we coded the type of conversational move that the 
discourse marker introduced. The conversational move an- 
notations, described in Table attempt to capture speaker 
intent rather than the surface form of the utterance. We an- 
notated five of the Trains dialogs, containing a total of 401 
speaker turns and 24.5 minutes of speech. 

In accordance with Schiffrin, we found that utterances 
that summarize information are likely to be introduced with 
"so", utterances that add on to the speakers prior contribu- 
tion (and perhaps ignore the other conversants intervening 
contribution) are likely to be introduced with "and", and 
utterances that express dissent with the information in the 
discourse state are likely to be introduced with "well". Ta- 
ble g summarizes the co-occurrence of turn-initial discourse 
markers with the conversational moves that they introduce. 



Acknowledge Backchannel 'Okay' or 'mm-hm'. 

Check Restating old information to elicit a positive re- 
sponse from the partner (e.g. That was three hours to 
Bath?). 

Confirm Restating old information, with no apparent in- 
tention of partner agreement. 

Filled Pause A turn containing no information such as 
'hm'. 

Inform Information not previously made explicit. 
Request Request for information. 
Respond Respond to a Request. 

Y/N Question Questions requiring a yes/no answer. Differ 
from Check because the speaker displays no bias toward 
which answer he expects. 

Y/N Answer Answering 'yes', 'no', 'right', etc. 

Table 8: Speech Act annotations 





Total 


Turn begins with 


DM Turns 




Turns 


And 


Oh 


So 


Well 


% of Total 


Prior speech act initi 


ates adjacency pair 


Check 


23 











1 


4% 


Request Info 


45 








1 





2% 


Y/N Question 


8 














0% 


Prior speech act concludes adjacency pair 


Respond 


38 


3 


2 


5 


1 


30% 


Y/N Answer 


26 


1 


1 


1 





12% 


Acknowledge 


107 


21 


4 


16 


2 


40% 


Prior speech act not in adjacency pair 


Confirm 


42 


2 








1 


7% 


Inform 


96 


1 


10 


5 


2 


19% 


Filled Pause 


6 














0% 



Table 9: Prior speech act of DM-initial turns 



The table shows that different discourse markers strongly 
correlated with particular conversational moves. Because 
discourse markers are found in turn-initial position, they 
can be used as a timely indicator of the conversational move 
about to be made. 

A more traditional method for analyzing the function of 
turns in a dialog is to focus on their surface form by cate- 
gorizing them into speech acts, so we wanted to see if this 
sort of analysis would reveal anything interesting about dis- 
course marker usage in the Trains dialogs. Table || defines 
the speech acts that were used to annotate the dialogs. We 
found that discourse markers on the whole do not corre- 
late strongly with particular speech acts, as they did with 
conversational moves. This is corroborated by Schiffrin's 



(1987) corpus analysis, in which she concluded that turn- 
initiators reveal little about the construction of the upcom- 
ing turn. Although not correlating with syntactic construc- 
tion, discourse markers do interact with the local discourse 



structure property of adjacency pairs (Schegloff & Sacks 
1973). In an adjacency pair, such as Question/ Answer or 
Greeting/Greeting, the utterance of the first speech act of 
the pair sets up an obligation for the partner to produce the 
second speech act of the pair. After the first part of an adja- 
cency pair has been produced, there is a very strong expec- 
tation about how the next turn will relate to the preceding 
discourse, e.g. it will provide an answer to the question just 
asked. 

Since discourse markers help speakers signal how the 
current turn relates to prior talk, we decided to investigate 
what speech acts discourse markers tend to follow and how 
they correlate with adjacency pairs. Table ^| shows the prior 
speech act of turns beginning with discourse markers. The 
speech acts have been organized into those that form the 
first part of an adjacency pair (Request Info, Y/N Question, 
and Check), those that form second-pair-parts (Respond, 
Y/N/ Answer, and Acknowledge), and those that are not 
part of an adjacency pair sequence (Confirm, Inform, and 
Filled Pause). The table reveals the very low frequency of 
discourse marker initial turns after the initiation of an ad- 
jacency pair. After an adjacency pair has been initiated, 
the next turn almost never begins with a discourse marker, 
because the turn following the initiation of an adjacency 
pair is expected to be the completion of the pair. Since the 
role of that turn is not ambiguous, it does not need to begin 
with a discourse marker to mark its relationship to preced- 
ing talk. It would indeed be odd if after a direct question 
such as "so how many hours is it from Avon to Dansville" 
the system responded "and 6" or "so 6". A possible ex- 
ception would be to begin with "well" if the upcoming ut- 
terance is a correction rather than an answer. There is one 
"so" turn in the annotated dialogs after a Request act, but it 
is a request for clarification of the question. 

After a turn that is not the initiation of an adjacency pair, 
such as Acknowledge, Respond, or Inform, the next turn 
has a much higher probability of beginning with a discourse 
marker. Also when the prior speech act concludes an adja- 
cency pair, the role of the next statement is ambiguous, so 
a discourse marker is used to mark its relationship to prior 
discourse. 

In this section, we demonstrated that the choice of dis- 
course marker gives evidence as to the type of conversa- 
tional move that the speaker is about to make. Further- 
more, discourse markers are more likely to be used where 
there are not strong expectations about the utterance that the 
speaker is about to make. Thus, discourse markers provide 
hearers with timely information as to how the upcoming 
speech should be interpreted. 

Usefulness of Discourse Markers 

We have also shown that discourse markers can be reliably 
identified in task-oriented spontaneous speech. The results 



given in the previous section show that knowledge of the 
discourse marker leads to strong expectations of the speech 
that will follow. However, none of the work in using ma- 
chine learning techniques to predict the speech act of the 
users speech has used the presence of a discourse marker. 
Chu-Carroll (1998) examined syntactic type of the utter- 



ance and turn-taking information, but not the presence of a 



discourse marker. The work of Taylor et al. (1997) on using 



prosody to identify discourse act type also ignores the pres- 
ence of discourse markers. Work of Stolcke et al. (1998) 
also ig nores them. As Dahlback and Jonsson observed 



( |1992[ ), it might be that speakers drop the usage of discourse 
markers in talking with computer systems, but this might be 
more of an effect of the current abilities of such systems and 
user perceptions of them, rather than that people will not 
want to use these as their perception of computer dialogue 
systems increases. A first step in this direction is to make 
use of these markers in dialogue comprehension. Machine 
learning algorithms of discourse acts are ideally suited for 
this task. 

Conclusion 

In this paper, we have shown that discourse markers can 
be identified very reliably in spoken dialogue by view- 
ing the identification task as part of the process of part- 
of-speech tagging and using a Markov model approach to 
identify them. The identification process can be incorpo- 
rated into speech recognition, and this leads to a small re- 
duction in both the word perplexity and POS tagging error 
rate. Incorporating other aspects of spontaneous speech, 
namely speech repair resolution and identification of in- 
tonation phrase boundary tones, leads to further improve- 
ments in our ability to identify discourse markers. 

Our method for identifying discourse markers views this 
task as part of the speech recognition problem along with 
POS tagging. As such, rather than classifying each po- 
tential word independently as to whether it is a discourse 
marker or not (cf. (Litman 1996)), we find the best inter- 



pretation for the acoustic signal, which includes identifying 
the discourse markers. Using this approach means that the 
probability distributions that need to be estimated are more 
complicated than those traditionally used in speech recog- 
nition language modeling. Hence, we make use of a deci- 
sion tree algorithm to partition the training data into equiv- 
alence classes from which the probability distributions can 
be computed. 

Automatically identifying discourse markers early in the 
processing stream means that we can take advantage of their 
presence to help predict the following speech. In fact, we 
have shown that discourse markers not only can be used to 
help predict how the speaker's subsequent speech will build 
on to the discourse state, but also are often used when there 
are not already strong expectations, in terms of adjacency 



pairs. However, most current spoken dialogue systems ig- 
nore their presence, even though they can be easily incorpo- 
rated into existing machine learning algorithms that predict 
discourse act types. 
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